Problem #1

Regression

Hypothesize a linear regression relationship:

We wanted to build on the previous hypothesis that distance could be used to predict the fare of a route by adding the number of passengers who fly the route per day on average. We feel like more popular flights would be cheaper than those with low flight traffic. Additionally we felt that the best 3rd explanatory variable to include in this analysis was the relationship between distance and passengers. The other options based on the provided dataset just didn’t seem to mesh as well with the two that we have included already. Given the differences in value ranges between all the variables and outputs we choose to use the log of each value.

\[ \underbrace{Y_i}_\text{fare} \underbrace{=}_{\sim} \overbrace{\beta_0}^{\stackrel{\text{y-int}}{\text{base fare}}} + \overbrace{\beta_1}^{\stackrel{\text{slope}}{\text{baseline}}} \underbrace{X_{1i}}_\text{ldistance} + \overbrace{\beta_2}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{lpassen} + \overbrace{\beta_3}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{ldist:lpassen} + \overbrace{\beta_4}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{4i}}_\text{LargeShare} + \overbrace{\beta_5}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{5i}}_\text{y00} + \overbrace{\beta_6}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{6i}}_\text{y97} +\epsilon_i \]

Best Multiple Regression results

Below is the Multiple regression result using distance, passengers, and with a few additional variables.

lm.mult <-lm(lfare ~ ldist + lpassen + ldist:lpassen + LargeShare + y00 + y97, data=IO_airfare)
summary(lm.mult) %>%
pander(caption= "HW 4 Multiple regression results w/ extra estimators")
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.596 0.3353 22.65 8.764e-108
ldist -0.3199 0.05027 -6.363 2.168e-10
lpassen -0.8653 0.0536 -16.15 4.459e-57
LargeShare 0.0643 0.01249 5.149 2.729e-07
y00 0.0662 0.01184 5.59 2.406e-08
y97 -0.02985 0.01184 -2.522 0.01171
ldist:lpassen 0.1199 0.008095 14.81 1.618e-48
HW 4 Multiple regression results w/ extra estimators
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
4595 0.3275 0.4376 0.4369

Below is the Multiple regression result using distance, passengers, but without the extra variables

lm.mult2 <-lm(lfare ~ ldist + lpassen + ldist:lpassen, data=IO_airfare)
summary(lm.mult2) %>%
pander(caption= "HW 4 Simple Multiple regression w/o extra estimators")
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.08 0.3288 24.57 2.306e-125
ldist -0.3855 0.0496 -7.771 9.51e-15
lpassen -0.9209 0.05336 -17.26 1.027e-64
ldist:lpassen 0.1277 0.00808 15.81 7.441e-55
HW 4 Simple Multiple regression w/o extra estimators
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
4595 0.3301 0.4282 0.4278
Confidence Intervals
confint(lm.mult2, level = 0.95) %>%
pander(caption= "HW 4 Estimators 95% Conf Int's")
HW 4 Estimators 95% Conf Int’s Assuming that our decision to reject the null Hypothesis is correct then these 95% intervals capture the true coefficient values for each estimator 95% of the time given all possible results of out sample space. The most important result from these values is that non of the ranges include 0, so we can safely assume that each of these values are not equal to 0 at a 95% confidence level.
  2.5 % 97.5 %
(Intercept) 7.435 8.724
ldist -0.4827 -0.2882
lpassen -1.025 -0.8163
ldist:lpassen 0.1119 0.1436
HW 2 Simple Regression Results

Here are the results from HW 2 regression, prediction of fare using just distance.

lm.sim <-lm(fare ~ dist, data=IO_airfare)
summary(lm.sim) %>%
pander(caption= "HW 2 simple regression results")
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 103.3 1.643 62.87 0
dist 0.07631 0.001412 54.05 0
HW 2 simple regression results
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
4595 58.55 0.3888 0.3886
Completed Regression Equation

Here is the Base equation for the regression w/o Extra variables

\[ \underbrace{Y_i}_\text{fare} \underbrace{=}_{\sim} \overbrace{\beta_0}^{\stackrel{\text{y-int}}{\text{base fare}}} + \overbrace{\beta_1}^{\stackrel{\text{slope}}{\text{baseline}}} \underbrace{X_{1i}}_\text{ldistance} + \overbrace{\beta_2}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{lpassen} + \overbrace{\beta_3}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{ldist:lpassen} + \epsilon_i \]

Here is the original equation for the regression with the appropriate coefficients now included.

\[ \underbrace{Y_i}_\text{lfare} \underbrace{=}_{\sim} \overbrace{8.074}^{\stackrel{\text{y-int}}{\text{base lfare}}} + \overbrace{-0.3854}^{\stackrel{\text{slope}}{\text{baseline}}} \underbrace{X_{1i}}_\text{ldistance} + \overbrace{-0.9208}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{lpassen} + \overbrace{0.1277}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{ldist:lpassen} + \epsilon_i \]

Plot

#b <- coef(lm.mult)
## Hint: library(car) has a scatterplot 3d function which is simple to use
#  but the code should only be run in your console, not knit.

library(car)
#scatter3d(fare ~ dist + passen, data=IO_airfare)



## To embed the 3d-scatterplot inside of your html document is harder.


#Perform the multiple regression

#Graph Resolution (more important for more complex shapes)
graph_reso <- 0.5

#Setup Axis
axis_x <- seq(min(IO_airfare$ldist), max(IO_airfare$ldist), by = graph_reso)
axis_y <- seq(min(IO_airfare$lpassen), max(IO_airfare$lpassen), by = graph_reso)

#Sample points
lmnew <- expand.grid(ldist = axis_x, lpassen = axis_y, KEEP.OUT.ATTRS=F)
lmnew$Z <- predict.lm(lm.mult2, newdata = lmnew)
lmnew <- acast(lmnew, lpassen ~ ldist, value.var = "Z") #y ~ x

#Create scatterplot
plot_ly(IO_airfare, 
        x = ~ldist, 
        y = ~lpassen, 
        z = ~lfare,
        text = rownames(IO_airfare), 
        type = "scatter3d", 
        mode = "markers", color=~lfare) %>%
  add_trace(z = lmnew,
            x = axis_x,
            y = axis_y,
            type = "surface")
  #add_trace(z = lmnew,
   #         x = axis_x,
    #        y = axis_y,
     #       type = "surface")

Interpretation/Assumptions

Interpretation

Based on the multiple regression, the base cost of a ticket would be $118.70, for each additional percentage increase in distance the fare would decrease by a percentage of 0.3852 and for each additional percent increase in average passengers the fare would decrease by a percentage of 0.9208. The strength or the relationship between Distance and passengers is ~0. The P-values for each of these terms are all incredibly close to 0.

These relationships are visible best when viewing the 3d plot. It is quickly apparent that all estimators have similarly weighted affects on the predicted values, as the points are spread evenly through the central area of the chart..

Assumptions

Assuming that our sample is random the following Q-Q plots aid in examining the residuals of our points. The first primarily helps to show if variance remains constant across our variables. The second shows some minor signs of right skewness, we agree this is likely due to the fact that the regression fails to predict base costs of a flight leaving these values up to B0. The third and final plot helps determine if the order of the data is important, usually this is needed for time sorted data but we noticed this set is sorted alphabetically by origin point so we included this to see if any patterns presented themselves.

From these plots the primary change from the non-log version is that the extremes now show significantly less variance so the confidence in predictions for extremes may be greater.

par(mfrow=c(1,3))
plot(lm.mult2,which=1:2)
plot(lm.mult2$residuals)

Problem 2

X401ksubs <- read_excel("401ksubs.xls")
X401ksubsF <- X401ksubs %>% filter(fsize== 1)

X401ksubsF$fsize <- as.factor(X401ksubsF$fsize)

How many single-person households are there in the data set?

There are 2017 single person households in the data set.

Using OLS estimate the following regression equation and interpret the results of the estimated equation

This tells us that for every one dollar increase in income net financial wealth increases by 0.95 dollars. Also that for every year that age increases, net financial wealth increases by 1.03 dollars.

Does the intercept term you estimated have any interesting meaning? Explain

The intercept implies that people are 60,000 dollars in debt from the moment they are born (nettfa is measured in 1,000s). This is because at the moment of birth, income is going to be 0, and so is age.

Conduct a t-test for each of the estimated parameters and interpret their meaning.

lm.mult3 <-lm(nettfa ~ inc + age, data=X401ksubs)
summary(lm.mult3)
## 
## Call:
## lm(formula = nettfa ~ inc + age, data = X401ksubs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -509.27  -18.71   -4.09   10.02 1464.74 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -60.69654    2.59633  -23.38   <2e-16 ***
## inc           0.95336    0.02528   37.72   <2e-16 ***
## age           1.03078    0.05912   17.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58.31 on 9272 degrees of freedom
## Multiple R-squared:  0.1691, Adjusted R-squared:  0.1689 
## F-statistic: 943.2 on 2 and 9272 DF,  p-value: < 2.2e-16

When we conducted a t-test for each of the variables we got high t-values and p-values of less than 0.01 each time so we should be able to reject the null hypotheses that our betas are equal to zero.

Run a simple regression of nettfa using just inc, does it significantly change the estimated coefficient for inc? Explain why this may be.

lm.mult3 <-lm(nettfa ~ inc, data=X401ksubs)
summary(lm.mult3)
## 
## Call:
## lm(formula = nettfa ~ inc, data = X401ksubs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -504.39  -18.10   -4.29    6.73 1475.04 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20.17948    1.17643  -17.15   <2e-16 ***
## inc           0.99991    0.02554   39.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59.26 on 9273 degrees of freedom
## Multiple R-squared:  0.1418, Adjusted R-squared:  0.1417 
## F-statistic:  1532 on 1 and 9273 DF,  p-value: < 2.2e-16

Yes, it increases the coefficient for income to 0.999. This may be because there are fewer explanatory variables and there is no longer a value changing the intercept of the regression line. We can no longer group the predictions by age and must find a slope that fits the data best as whole from the same initial base debt.